{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _clean_headers_user_guide:\n",
    "\n",
    "Column Headers\n",
    "=============="
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "------------\n",
    "\n",
    "The function :func:`clean_headers() <dataprep.clean.clean_headers.clean_headers>` cleans column headers in a DataFrame, and standardizes them in a desired format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Column names can be converted to the following case styles via the `case` parameter:\n",
    "\n",
    "* snake: \"column_name\"\n",
    "* kebab: \"column-name\"\n",
    "* camel: \"columnName\"\n",
    "* pascal: \"ColumnName\"\n",
    "* const: \"COLUMN_NAME\"\n",
    "* sentence: \"Column name\"\n",
    "* title: \"Column Name\"\n",
    "* lower: \"column name\"\n",
    "* upper: \"COLUMN NAME\"\n",
    "\n",
    "After cleaning, a **report** is printed that provides the number and percentage of values that were cleaned (the value must be transformed).\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_headers()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example dirty dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame(\n",
    "        {\n",
    "            \"ISBN\": [9781455582341],\n",
    "            \"isbn\": [1455582328],\n",
    "            \"bookTitle\": [\"How Google Works\"],\n",
    "            \"__Author\": [\"Eric Schmidt, Jonathan Rosenberg\"],\n",
    "            \"Publication (year)\": [2014],\n",
    "            \"éditeur\": [\"Grand Central Publishing\"],\n",
    "            \"Number_Of_Pages\": [305],\n",
    "            \"★ Rating\": [4.06],\n",
    "        }\n",
    "    )\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Default `clean_headers()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the `case` parameter is set to \"snake\" and the `remove_accents` parameter is set to True (strip accents and symbols from the column name)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_headers\n",
    "clean_headers(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that \"_1\" is appended to the second instance of the column name \"isbn\" to distinguish it from the first instance after the transformation. Consequently, all column names are considered to have been cleaned in this example.\n",
    "\n",
    "Column names that are duplicated as a result of calling `clean_headers()` are automatically renamed to append a number to the end. The suffix used to append the number is inferred from the `case` parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. `case` parameter\n",
    "\n",
    "This section demonstrates the supported case styles.\n",
    "\n",
    "### kebab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"kebab\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### camel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"camel\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### pascal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"pascal\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### const"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"const\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"sentence\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"title\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### lower"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"lower\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### upper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, case=\"upper\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `replace` parameter\n",
    "\n",
    "The `replace` parameter takes in a dictionary of values in the column names to be replaced by new values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, replace={\"éditeur\": \"publisher\", \"★\": \"star\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `remove_accents` parameter\n",
    "\n",
    "By default, the `remove_accents` parameter is set to True (strip accents and symbols from the column names). If set to False, any accents or symbols are kept in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df, remove_accents=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Null headers\n",
    "\n",
    "Null column headers in the DataFrame are replaced with the default value \"header\". As with other column names, duplicated values are renamed with appended numbers. Null header values include `np.nan`, `None` and the empty string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"\": [9781455582341],\n",
    "                   np.nan: [\"How Google Works\"],\n",
    "                   None: [\"Eric Schmidt, Jonathan Rosenberg\"],\n",
    "                   \"N/A\": [2014],\n",
    "                  })\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_headers(df)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
